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Abstract 

Structural equation models and Bayesian networks have been widely 
used to analyze causal relations between continuous variables. In such 
frameworks, linear acyclic models are typically used to model the data- 
generating process of variables. Recently, it was shown that use of non- 
Gaussianity identifies the full structure of a linear acyclic model, i.e., a 
causal ordering of variables and their connection strengths, without using 
any prior knowledge on the network structure, which is not the case with 
conventional methods. However, existing estimation methods are based on 
iterative search algorithms and may not converge to a correct solution in 
a finite number of steps. In this paper, we propose a new direct method 
to estimate a causal ordering and connection strengths based on non- 
Gaussianity. In contrast to the previous methods, our algorithm requires 
no algorithmic parameters and is guaranteed to converge to the right 
solution within a small fixed number of steps if the data strictly follows 
the model. 
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1 Introduction 



Many empirical sciences aim to discover and understand causal mechanisms 
underlying various natural phenomena and human social behavior. An effec- 
tive way to study causal relationships is to conduct a controlled experiment. 
However, performing controlled experiments is often ethically impossible or too 
expensive in many fields including social sciences [T], bioinformatics [2] and 
neuroinformatics [3]. Thus, it is necessary and important to develop methods 
for causal inference based on the data that do not come from such controlled 
experiments. 

Structural equation models (SEM) [1] and Bayesian networks (BN) [HE] are 
widely applied to analyze causal relationships in many empirical studies. A 
linear acyclic model that is a special case of SEM and BN is typically used to 
analyze causal effects between continuous variables. Estimation of the model 
commonly uses only the covariance structure of the data and in most cases can- 
not identify the full structure, i.e., a causal ordering and connection strengths, 
of the model with no prior knowledge on the structure [4j[5] . 

In [3] , a non-Gaussian variant of SEM and BN called a linear non-Gaussian 
acyclic model (LiNGAM) was proposed, and its full structure was shown to be 
identifiable without pre-specifying a causal order of the variables. This feature 
is a significant advantage over the conventional methods [4,5 . A non-Gaussian 
method to estimate the new model was also developed in [6] and is closely re- 
lated to independent component analysis (ICA) [7J. In the subsequent studies, 
the non-Gaussian framework has been extended in various directions for learn- 
ing a wider variety of SEM and BN [&MlOj. In what follows, we refer to the 
non-Gaussian model as LiNGAM and the estimation method as ICA-LiNGAM 
algorithm. 

Most of major ICA algorithms including [TTIfT^] are iterative search methods 
[7J. Therefore, the ICA-LiNGAM algorithms based on the ICA algorithms need 
some additional information including initial guess and convergence criteria. 
Gradient-based methods [IT] further need step sizes. However, such algorithmic 
parameters are hard to optimize in a systematic way. Thus, the ICA-based 
algorithms may get stuck in local optima and may not converge to a reasonable 
solution if the initial guess is badly chosen [T5] . 

In this paper, we propose a new direct method to estimate a causal ordering 
of variables in the LiNGAM without prior knowledge on the structure. The 
new method estimates a causal order of variables by successively reducing each 
independent component from given data in the model, and this process is com- 
pleted in steps equal to the number of the variables in the model. It is not based 
on iterative search in the parameter space and needs no initial guess or similar 
algorithmic parameters. It is guaranteed to converge to the right solution within 
a small fixed number of steps if the data strictly follows the model, i.e., if all the 
model assumptions are met and the sample size is infinite. These features of the 
new method enable more accurate estimation of a causal order of the variables 
in a disambiguated and direct procedure. Once the causal orders of variables is 
identified, the connection strengths between the variables are easily estimated 
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using some conventional covariance-based methods such as least squares and 
maximum likelihood approaches pQ. We also show how prior knowledge on the 
structure can be incorporated in the new method. 

The paper is structured as follows. First, in Section we briefly review 
LiNG AM and the IC A-bascd LiNGAM algorithm. We then in Section H intro- 
duce a new direct method. The performance of the new method is examined 
by experiments on artificial data in Section [4J and experiments on real- world 
data in Section [5] Conclusions are given in Section [SJ Preliminary results were 
presented in [TlHTo] . 

2 Background 

2.1 A linear non-Gaussian acyclic model: LiNGAM 

In [BJ, a non-Gaussian variant of SEM and BN, which is called LiNGAM, was 
proposed. Assume that observed data are generated from a process represented 
graphically by a directed acyclic graph, i.e., DAG. Let us represent this DAG 
by a mxm adjacency matrix B={6y } where every bij represents the connection 
strength from a variable Xj to another Xi in the DAG. Moreover, let us denote 
by k(i) a causal order of variables Xi in the DAG so that no later variable 
determines or has a directed path on any earlier variable. (A directed path 
from Xi to Xj is a sequence of directed edges such that Xj is reachable from Xj.) 
We further assume that the relations between variables are linear. Without loss 
of generality, each observed variable x\ is assumed to have zero mean. Then we 
have 



where e\ is an external influence. All external influences e, are continuous 
random variables having non-Gaussian distributions with zero means and non- 
zero variances, and arc independent of each other so that there is no latent 
confounding variables [5]. 

Wc rewrite the model (p} in a matrix form as follows: 



where x is a p-dimensional random vector, and B could be permuted by simul- 
taneous equal row and column permutations to be strictly lower triangular due 
to the acyclicity assumption pQ. Strict lower triangularity is here defined as a 
lower triangular structure with all zeros on the diagonal. Our goal is to estimate 
the adjacency matrix B by observing data x only. Note that we do not assume 
that the distribution of x is faithful [5] to the generating graph. 

We note that each bij represents the direct causal effect of Xj on Xi and each 
Ojj, the (i, i)-the element of the matrix A=(I — B) _1 , the total causal effect of 
Xj on Xi [17] . 

We emphasize that Xi is equal to if no other observed variable Xj (j^i) 
inside the model has a directed edge to Xi, i.e., all the fey (j^i) are zeros. In 




(1) 



kU)<k(i) 



x = Bx + e, 



(2) 
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such a case, an external influence e, is observed as Xi. Such an Xi is called 
an exogenous observed variable. Otherwise, is called an error. For example, 
consider the model defined by 

x 2 = e 2 

x\ = 1.5x2 + ei 

x 3 = 0.8x1 — 1.5x2 + e 3 , 

where £2 is equal to e 2 since it is not determined by either x\ or X3. Thus, x% 
is an exogenous observed variable, and e\ and e 3 are errors. Note that there 
exists at least one exogenous observed variable Xi(=ei) due to the acyclicity and 
the assumption of no latent confounders. 

An exogenous observed variable is usually defined as an observed variable 
that is determined outside of the model [T|. In other words, an exogenous ob- 
served variable is a variable that any other observed variable inside the model 
does not have a directed edge to. The definition does not require that it is equal 
to an independent external influence, and the external influences of exogenous 
observed variables may be dependent. However, in the LiNGAM ([2]), an exoge- 
nous observed variable is always equal to an independent external influence due 
to the assumption of no latent confounders. 

2.2 Identifiability of the model 

We next explain how the connection strengths of the LiNGAM ([2]) can be iden- 
tified as shown in [5J. Let us first solve Eq. @ for x. Then we obtain 

x = Ae, (3) 

where A = (I — B) _1 is a mixing matrix whose elements are called mixing 
coefficients and can be permuted to be lower triangular as well due to the afore- 
mentioned feature of B and the nature of matrix inversion. Since the compo- 
nents of e are independent and non-Gaussian, Eq. ([3]) defines the independent 
component analysis (ICA) model [7] , which is known to be identifiable fL8l[T9] . 

ICA essentially can estimate A (and W = A -1 = I B), but has per- 
mutation, scaling and sign indetcrminacies. ICA actually gives W/ca=PDW, 
where P is an unknown permutation matrix, and D is an unknown diagonal 
matrix. But in LiNGAM, the correct permutation matrix P can be found [BJ: 
the correct P is the only one that gives no zeros in the diagonal of DW since 
B should be a matrix that can be permuted to be strictly lower triangular and 
W = I B. Further, one can find the correct scaling and signs of the indepen- 
dent components by using the unity on the diagonal of W=I— B. One only has 
to divide the rows of DW by its corresponding diagonal elements to obtain W. 
Finally, one can compute the connection strength matrix B = I — W. 

2.3 ICA-LiNGAM algorithm 

The ICA-LiNGAM algorithm presented in [6J is described as follows: 
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ICA-LiNGAM algorithm 

1. Given a p-dimensional random vector x and its p x n observed data matrix 
X, apply an ICA algorithm (FastICA using hyperbolic tangent function |12j ) 
to obtain an estimate of A. 

2. Find the unique permutation of rows of W=A _1 which yields a matrix W 
without any zeros on the main diagonal. The permutation is sought by mini- 
mizing Y,i i/|Wii|. 

3. Divide each row of W by its corresponding diagonal element, to yield a new 
matrix W' with all ones on the diagonal. 

4. Compute an estimate B of B using B = I W. 

5. Finally, to estimate a causal order k(i), find the permutation matrix P of B 
yielding a matrix B = PBP T which is as close as possible to a strictly lower 
triangular structure. The lower-triangularity of B can be measured using the 
sum of squared bij in its upper triangular part Y2i<j ^% f° r sma H number of 
variables, say less than 8. For higher-dimensional data, the following approx- 
imate algorithm is used, which sets small absolute valued elements in B to 
zero and tests if the resulting matrix is possible to be permuted to be strictly 
lower triangular: 

(a) Set the p(p + l)/2 smallest (in absolute value) elements of B to zero. 

(b) Repeat 

i. Test if B can be permuted to be strictly lower triangular. If the 
answer is yes, stop and return the permuted B, that is, B. 

ii. Additionally set the next smallest (in absolute value) element of B 
to zero. 



2.4 Potential problems of ICA-LiNGAM 

The original ICA-LiNGAM algorithm has several potential problems: i) Most 
ICA algorithms including FastICA [12] and gradient-based algorithms [11] may 
not converge to a correct solution in a finite number of steps if the initially 
guessed state is badly chosen [13] or if the step size is not suitably selected for 
those gradient-based methods. The appropriate selection of such algorithmic 
parameters is not easy. In contrast, our algorithm proposed in the next sec- 
tion is guaranteed to converge to the right solution in a fixed number of steps 
equal to the number of variables if the data strictly follows the model, ii) The 
permutation algorithms in Steps [2] and [5] are not scale- invariant. Hence they 
could give a different or even wrong ordering of variables depending on scales 
or standard deviations of variables especially when they have a wide range of 
scales. However, scales are essentially not relevant to the ordering of variables. 
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Though such bias would vanish for large enough sample sizes, for practical sam- 
ple sizes, an estimated ordering could be affected when variables are normalized 
to make unit variance for example, and hence the estimation of a causal ordering 
becomes quite difficult. 

3 A direct method: DirectLiNGAM 

3.1 Identification of an exogenous variable based on non- 
Gaussianity and independence 

In this subsection, we present two lemmas and a corollarjQ that ensure the 
validity of our algorithm proposed in the next subsection 13.21 The basic idea 
of our method is as follows. We first find an exogenous variable based on its 
independence of the residuals of a number of pairwise regressions (Lemma [T]). 
Next, we remove the effect of the exogenous variable from the other variables 
using least squares regression. Then, we show that a LiNGAM also holds for 
the residuals (Lemma [5]) and that the same ordering of the residuals is a causal 
ordering for the original observed variables as well (Corollary [TJ. Therefore, we 
can find the second variable in the causal ordering of the original observed vari- 
ables by analyzing the residuals and their LiNGAM, i.e., by applying Lemma [I] 
to the residuals and finding an "exogenous" residual. The iteration of these 
effect removal and causal ordering estimates the causal order of the original 
variables. 

Wc first quote Darmois-Skitovitch theorem [20j[2T] since it is used to prove 
Lemma [I] 

Theorem 1 (Darmois-Skitovitch theorem) Define two random variables y\ 
and ij2 as linear combinations of independent random variables Si(i=l, ■ ■ ■ , q): 

Vi=^2^iSt, y2=^fts t - (4) 

i=l i=l 

Then, if yi and y<x are independent, all variables Sj for which ctjftj ^ are 
Gaussian, rj 

In other words, this theorem means that if there exists a non-Gaussian Sj for 
which a.jf3j^0, y\ and ?/2 are dependent. 

Lemma 1 Assume that the input data x strictly follows the LiNGAM Q). De- 
note by rf' the residuals when x% are regressed on xy. = a?j — ~^^ry~ x j 
(i =/= j). Then a variable Xj is exogenous if and only if xj is independent of its 
residuals for all i ^ j. rj 

We prove the lemmas and corollary without assuming the faithfulness [5] unlike our pre- 
vious work 1141 . 
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Proof (i) Assume that Xj is exogenous, i.e., Xj=€j. Due to the model assump- 
tion and Eq. @, one can write Xi=aijXj+e < f * > (i^j), where =y~] h ,. ajh&h 
and Xj are independent, and a.y is a mixing coefficient from Xj to Xi in Eq. J3]). 
The mixing coefficient is equal to the regression coefficient when xi is re- 
gressed on Xj since cov(x.;, Xj)=a,ijvaz{xj). Thus, the residual r^' is equal to the 

corresponding error term, i.e., r^=e^. This implies that Xj and r^\=ef') 
arc independent. 

(ii) Assume that Xj is not exogenous, i.e., Xj has at least one parent. Let Pj 
denote the (non-empty) set of the variable subscripts of parent variables of Xj. 
Then one can write Xj = E^gp ^jh x h + e ji where Xh and ej are independent 
and each bjh is non-zero. Let a vector xp. and a column vector hp j collect 
all the variables in Pj and the corresponding connection strengths, respectively. 
Then, the covariances between xp. and Xj HXC 

E (xp 3 Xj ) = E{x Pj (b£. x Pj + ej )} 

= E(x Pj bl.x Pj ) + E(x Pj ej) 

= S(x P .x?.)bp r (5) 

The covariance matrix E^xp^.Xp. ) is positive definite since the external influ- 
ences eh that correspond to those parent variables Xh in Pj are mutually inde- 
pendent and have positive variances. Thus, the covariance vector E(jx.p j Xj) = 
E(yLp j xJ,.)bp j in Eq. (|5|) cannot equal the zero vector, and there must be at 
least one variable Xi (i € Pj) with which Xj covaries, i.e., cov(xi, Xj)^0. Then, 
for such a variable Xi (i £ Pj) that cov(xi, Xj)j^0, we have 

Jj) _ _ Covin,!,) 

1 - 1 varfo) J 



cov(xj,Xj) 
vax(xj) 



b i hXh 

.hePi 



bjiCov(xi,Xj)\ cov(xi,Xj) \ - 

2^ °jhXh 



var(x 7 ) J var(x 7 ) 



COv(Xj,Xj] 

var(xj) 



(6) 



Each of those parent variables Xh (including Xi) in Pj is a linear combination 
of external influences other than ej due to the relation of Xh to ej that Xj = 

E/^p, ^ifc^fc + = £ hePj . fyfc (Efc(t)<k(h) owet) + e j . where e * and e i are 
independent. Thus, the r\ and Xj can be rewritten as linear combinations of 
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independent external influences as follows: 



.0') 




var(xj) 




COv(Xi,Xj) 



(7) 



var(x i ) 



(8) 



The first two terms of Eq. and the first term of Eq. ([5]) are linear combinations 
of external influences other than ej, and the third term of Eq. ([7]) and the 
second term of Eq. ((SJ depend only on ej and do not depend on the other 
external influences. Further, all the external influences including ej are mutually 

independent, and the coefficient of non-Gaussian ej on and that on Xj arc 
non-zero. These imply that and Xj 8)1" G dependent since Xj and ej 

correspond to y\, y%, Sj in Darmois-Skitovitch theorem, respectively. 
From (i) and (ii), the lemma is proven. ■ 

Lemma 2 Assume that the input data x strictly follows the LiNGAM f2[). Fur- 
ther, assume that a variable Xj is exogenous. Denote byv^' a (p-1)- dimensional 

vector that collects the residuals rf when all Xi o/x are regressed on Xj (i^j)- 
Then a LiNGAM holds for the residual vector r^>': r^') = B^r* 5 ') +e^\ where 
B") is a matrix that can be permuted to be strictly lower-triangular by a simul- 
taneous row and column permutation, and elements of are non-Gaussian 
and mutually independent, rj 

Proof Without loss of generality, assume that B in the LiNGAM ([2]) is already 
permuted to be strictly lower triangular and that Xj=x\. Note that A in Eq. (0) 
is also lower triangular (although its diagonal elements are all ones). Since x\ is 
exogenous, an are equal to the regression coefficients when x\ are regressed on 
X\ (i ^ 1). Therefore, after removing the effects of x\ from x% by least squares 
estimation, one gets the first column of A to be a zero vector, and x\ does not 
affect the residuals rj . Thus, we again obtain a lower triangular mixing matrix 
A^ 1 ) with all ones in the diagonal for the residual vector r^ and hence have a 
LiNGAM for the vector r (1 \ ■ 

Corollary 1 Assume that the input data x strictly follows the LiNGAM 0). 
Further, assume that a variable Xj is exogenous. Denote by k r (j) (i) a causal 
order of rf ■ Recall that k(i) denotes a causal order of Xj. Then, the same 
ordering of the residuals is a causal ordering for the original observed variables 
as well: k r u) (l)<k r u) (m) k(l)<k(m). rj 



Proof As shown in the proof of Lemma [5J when the effect of an exogenous 
variable x\ is removed from the other observed variables, the second to p-th 
columns of A remain the same, and the submatrix of A formed by deleting 
the first row and the first column is still lower triangular. This shows that the 
ordering of the other variables is not changed and proves the corollary. ■ 

Lemma [2] indicates that the LiNGAM for the (p— l)-dimensional residual 
vector r^) can be handled as a new input model, and Lemma [T] can be fur- 
ther applied to the model to estimate the next exogenous variable (the next 
exogenous residual in fact). This process can be repeated until all variables 
are ordered, and the resulting order of the variable subscripts shows the causal 
order of the original observed variables according to Corollary [TJ 

To apply Lemma Q] in practice, we need to use a measure of independence 
which is not restricted to uncorrelatedness since least squares regression gives 
residuals always uncorrelated with but not necessarily independent of explana- 
tory variables. A common independence measure between two variables 2/1 and 
j/2 is their mutual information MJ (2/1, 2/2) [7J. I n [22] 1 a nonparametric estima- 
tor of mutual information was developed using kernel methods^ Let K\ and 
K 2 represent the Gram matrices whose elements are Gaussian kernel values of 
the sets of n observations of 2/1 and 2/2 , respectively. The Gaussian kernel values 
Ki(Vi\y[ ) and #2(2/2 ,2/2 ) («,i = 1, •■•>«) are computed by 



cxp 
cxp 



1 

'2^2 
1 



,(') 



,(j)||2 



Vi - Vi 

,W ».0')||2 



(9) 
(10) 



where <7>0 is the bandwidth of Gaussian kernel. Further let k denote a small 
positive constant. Then, in [22], the kernel-based estimator of mutual informa- 
tion is defined as: 



~t~7~ t , > 1 , det K, K 

Ml kernel \V 1,2/2) = ~ ~ log 



2 b det V K 1 



(11) 



where 



1C K 



v K = 



K 2 K X 

{Ki + W) 




K X K 2 
[K 2 + ^I) 



(12) 
(13) 



As the bandwidth a of Gaussian kernel tends to zero, the population counterpart 
of the estimator converges to the mutual information up to second order when it 
is expanded around distributions with two variables j/i and 2/2 being independent 
[22]. The determinants of the Gram matrices K\ and K 2 can be efficiently 



2 Matlab codes can be downloaded at |http: //www.di . ens . f r/~f b ach/kernel- ica/index .htm 
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computed by using the incomplete Cholesky decomposition to find their low- 
rank approximations of rank M (<#; n). In [22] , it was suggested that the positive 
constant k and the width of the Gaussian kernel a are set to k = 2 x 10 -3 , 
cr=l/2 for n > 1000 and k = 2 x 10~ 2 , a = 1 for n < 1000 due to some 
theoretical and computational considerations. 

In this paper, we use the kernel-based independence measure. We first eval- 
uate pairwise independence between a variable and each of the residuals and 
next take the sum of the pairwise measures over the residuals. Let us denote by 
U the set of the subscripts of variables Xi, i.e., U={1, ■ ■ ■ , p}. We use the fol- 
lowing statistic to evaluate independence between a variable Xj and its residuals 
r) = Xi — cov ^"- r j) x when xs is regressed on i,: 

T k ernel(Xj]U) = ^ MI kernel (Xj , r\ 0) ). (14) 

Many other nonparametric independence measures |23[[24] and more computa- 
tionally simple measures that use a single nonlinear correlation |25j have also 
been proposed. Any such proposed method of independence could potentially 
be used instead of the kernel-based measure in Eq. (fT4j) . 

3.2 DirectLiNGAM algorithm 

We now propose a new direct algorithm called DirectLiNGAM to estimate a 
causal ordering and the connection strengths in the LiNGAM ©: 

DirectLiNGAM algorithm 

1. Given a p-dimensional random vector x, a set of its variable subscripts U and 
a p x n data matrix of the random vector as X, initialize an ordered list of 
variables K := and m := 1. 

2. Repeat until p—1 subscripts are appended to K: 

(a) Perform least squares regressions of xi on Xj for all i 6 U\K (i / j) 
and compute the residual vectors and the residual data matrix R' J ' 
from the data matrix X for all j 6 U\K. Find a variable x m that is 
most independent of its residuals: 

x m = arg min T ker nei{xj;U\K), (15) 
jeu\K 

where Tkemei is the independence measure defined in Eq. (fl"4|) . 

(b) Append m to the end of K . 

(c) Let x := r( m ), X := R( m ). 

3. Append the remaining variable to the end of K. 
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4. Construct a strictly lower triangular matrix B by following the order in K, and 
estimate the connection strengths bij by using some conventional covariance- 
based regression such as least squares and maximum likelihood approaches 
on the original random vector x and the original data matrix X. We use least 
squares regression in this paper. 



3.3 Computational complexity 

Here, we consider the computational complexity of DirectLiNGAM compared 
with the ICA-LiNGAM with respect to sample size n and number of variables 
p. A dominant part of DirectLiNGAM is to compute Eq. (TH| for each Xj in 
Step 2(a). Since it requires 0(np 2 M 2 +p 3 M 3 ) operations [35] inp— 1 iterations, 
complexity of the step is 0(np 3 M 2 +p 4 M 3 ), where M (<C n) is the maximal rank 
found by the low-rank decomposition used in the kernel-based independence 
measure. Another dominant part is the regression to estimate the matrix B in 
Step |4] The complexity of many representative regressions including the least 
square algorithm is 0(np 3 ). Hence, we have a total budget of 0(np 3 M 2 +p 4 M 3 ) . 
Meanwhile, the ICA-LiNGAM requires 0(p 4 ) time to find a causal order in 
Step [5] Complexity of an iteration in FastICA procedure at Step [1] is known 
to be 0(np 2 ). Assuming a constant number C of the iterations in FastICA 
steps, the complexity of the ICA-LiNGAM is considered to be 0{Cnp 2 +p 4 ). 
Though general evaluation of the required iteration number C is difficult, it can 
be conjectured to grow linearly with regards to p. Hence the complexity of the 
ICA-LiNGAM is presumed to be 0(np 3 +p 4 ). 

Thus, the computational cost of DirectLiNGAM would be larger than that 
of ICA-LiNGAM especially when the low-rank approximation of the Gram ma- 
trices is not so efficient, i.e., M is large. However, we note the fact that Di- 
rectLiNGAM has guaranteed convergence in a fixed number of steps and is of 
known complexity, whereas for typical ICA algorithms including FastICA, the 
run-time complexity and the very convergence are not guaranteed. 



3.4 Use of prior knowledge 

Although DirectLiNGAM requires no prior knowledge on the structure, more 
efficient learning can be achieved if some prior knowledge on a part of the 
structure is available because then the number of causal orders and connection 
strengths to be estimated gets smaller. 

We present three lemmas to utilize prior knowledge in DirectLiNGAM. Let 
us first define a matrix A knw =[a^f w ] that collects prior knowledge under the 
LiNGAM © as follows: 



if Xi docs not have a directed path to x 



k„ w i 1 ii Xi has a directed path to Xj „<. 

■ —1 if no prior knowledge is available to know if either 
of the two cases above (0 or 1) is true. 
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Due to the definition of exogenous variables and that of prior knowledge 
matrix A knw , we readily obtain the following three lemmas. 

Lemma 3 Assume that the input data x strictly follows the LiNGAM @). An 
observed variable Xj is exogenous if a^f w is zero for all i^j. 

Lemma 4 Assume that the input data x strictly follows the LiNGAM Pp. An 
observed variable Xj is endogenous, i.e., not exogenous, if there exist such i^j 
that a k f w is unity. 

Lemma 5 Assume that the input data x strictly follows the LiNGAM An 
observed variable Xj does not receive the effect of Xi if a k f w is zero. 

The principle of making DircctLiNGAM algorithm more accurate and faster 
based on prior knowledge is as follows. We first find an exogenous variable 
by applying Lemma [3] instead of Lemma [JJ if an exogenous variable is identi- 
fied based on prior knowledge. Then we do not have to evaluate independence 
between any observed variable and its residuals. If no exogenous variable is 
identified based on prior knowledge, we next find endogenous (non-exogenous) 
variables by applying LemmaSJ Since endogenous variables are never exogenous 
we can narrow down the search space to find an exogenous variable based on 
Lemma [TJ We can further skip to compute the residual of an observed variable 
and take the variable itself as the residual if its regressor does not receive the 
effect of the variable due to Lemma [5] Thus, we can decrease the number of 
causal orders and connection strengths to be estimated, and it improves the 
accuracy and computational time. The principle can also be used to further an- 
alyze the residuals and find the next exogenous residual because of Corollary [TJ 
To implement these ideas, we only have to replace Step [2a] in DircctLiNGAM 
algorithm by the following steps: 

2a-l Find such a variable(s) Xj (j G U\K) that the j-th row of A knw has zero in 
the i-th column for all i G U\K (i ^ j) and denote the set of such variables 
by U exo . If U exo is not empty, set U c ■= U exo . If U exo is empty, find such 
a variable(s) Xj (j G U\K) that the j-th row of A knw has unity in the i-th 
column for at least one of i & U\K (i ^ j), denote the set of such variables 
by U end and set U c := U\K\U end . 

2a-2 Denote by V^> a set of such a variable subscript i G U\K (i ^= j) that 
fl knw _ q f or a || j £ Uc p jrst set r (i) := x . f or a || j g y(j) i next perform 

least squares regressions of x% on Xj for all i G U\K\V^ (i ^ j) and 
estimate the residual vectors rW and the residual data matrix R'^ from the 
data matrix X for all j G U c . If U c has a single variable, set the variable to 
be x m . Otherwise, find a variable x m in U c that is most independent of the 
residuals: 

x m = arg min T kernel (xj;U\K), (17) 
where Tkemei is the independence measure defined in Eq. (|14p . 
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4 Simulations 



We first randomly generated 5 datascts based on sparse networks under each 
combination of number of variables p and sample size n (j»=10, 20, 50, 100; 
n=500, 1000, 2000): 

1. We constructed the p x p adjacency matrix with all zeros and replaced 
every element in the lower-triangular part by independent realizations of 
Bernoulli random variables with success probability s similarly to |26j . 
The probability s determines the sparseness of the model. The expected 
number of adjacent variables of each variable is given by s(p — 1). We 
randomly set the sparseness s so that the number of adjacent variables 
was 2 or 5 |26j . 

2. We replaced each non-zero (unity) entry in the adjacency matrix by a value 
randomly chosen from the interval [—1.5,-0.5] U [0.5,1.5] and selected 
variances of the external influences a from the interval [1,3] as in |27j . 
We used the resulting matrix as the data-generating adjacency matrix B. 

3. We generated data with sample size n by independently drawing the ex- 
ternal influence variables ej from various 18 non-Gaussian distributions 
used in [32] including (a) Student with 3 degrees of freedom; (b) double 
exponential; (c) uniform; (d) Student with 5 degrees of freedom; (e) ex- 
ponential; (f) mixture of two double exponentials; (g)-(h)-(i) symmetric 
mixtures of two Gaussians: multimodal, transitional and unimodal; (j)- 
(k)-(l) nonsymmetric mixtures of two Gaussians, multimodal, transitional 
and unimodal; (m)-(n)-(o) symmetric mixtures of four Gaussians: multi- 
modal, transitional and unimodal; (p)-(q)-(r) nonsymmetric mixtures of 
four Gaussians: multimodal, transitional and unimodal. See Fig. 5 of [22] 
for the shapes of the probability density functions. 

4. The values of the observed variables Xi were generated according to the 
LiNGAM Finally, we randomly permuted the order of Xi. 

Further we similarly generated 5 datasets based on dense (full) networks, i.e., 
full DAGs with every pair of variables is connected by a directed edge, under 
each combination of number of variables p and sample size n. Then we tested Di- 
rectLiNGAM and ICA-LiNGAM on the datasets generated by sparse networks 
or dense (full) networks. For ICA-LiNGAM, the maximum number of iterations 
was taken as 1000 [BJ. The experiments were conducted on a standard PC using 
Matlab 7.9. Matlab implementations of the two methods are available on the 
web: 



DircctLiNGAM: http : //www . ar . sanken . osaka-u . ac . jp/~inazumi/dlingam . html 



ICA-LiNGAM: http : //www. cs .helsinki . f i/group/neuroinf /lingam/ 



We computed the distance between the true B and ones estimated by Di- 
rcctLiNGAM and ICA-LiNGAM using the Frobcnius norm defined as 



trace{(B true - B) T (B trae - B)}. (18) 
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Table 1: Median distances (Frobcnius norms) between true B and estimated B 
of DirectLiNGAM and ICA-LiNGAM with five replications. 



Sparse networks 


Sample size 
500 1000 2000 


DirectLiNGAM dim. = 10 

dim. = 20 
dim. = 50 
dim. = 100 


0.48 0.31 0.21 
1.19 0.70 0.50 
2.57 1.82 1.40 
5.75 4.61 2.35 


ICA-LiNGAM dim. = 10 

aim. = zU 
dim. = 50 
dim. = 100 


3.01 0.74 0.65 

y.Do o.UU z.UD 

20.61 20.23 12.91 
40.77 43.74 36.52 


DirectLiNGAM with dim. = 10 
prior knowledge (50%) dim. = 20 

dim. = 50 
dim. = 100 

T)pnn cp ( fnl 1 1 notwOTK^ 


0. 48 0.30 0.24 

i nn n 71 n /in 

1. UU U. ( 1 U.4t) 

2.47 1.75 1.19 
4.94 3.89 2.27 

500 1000 2000 


DirectLiNGAM dim. = 10 

dim. = 20 
dim. = 50 
dim. = 100 


0.45 0.46 0.20 
1.46 1.53 1.12 
4.40 4.57 3.86 
7.38 6.81 6.19 


ICA-LiNGAM dim. = 10 

dim. = 20 
dim. = 50 
dim. = 100 


1.71 2.08 0.39 
6.70 3.38 1.88 
17.28 16.66 12.05 
34.95 34.02 32.02 


DirectLiNGAM with dim. = 10 
prior knowledge (50%) dim. = 20 

dim. = 50 
dim. = 100 


0.45 0.31 0.19 
0.84 0.90 0.41 
2.48 1.86 1.56 
4.67 3.60 2.61 



Tables[T]and[2]sliow the median distances (Frobcnius norms) and median compu- 
tational times (CPU times) respectively. In Table [U DirectLiNGAM was better 
in distances of B and gave more accurate estimates of B than ICA-LiNGAM for 
all of the conditions. In Table [U the computation amount of DirectLiNGAM 
was rather larger than ICA-LiNGAM when the sample size was increased. A 
main bottleneck of computation was the kernel-based independence measure. 
However, its computation amount can be considered to be still tractable. In 
fact, the actual elapsed times were approximately one-quarter of their CPU 
times respectively probably because the CPU had four cores. Interestingly, the 
CPU time of ICA-LiNGAM actually decreased with increased sample size in 



14 



Table 2: Median computational times (CPU times) of DirectLiNGAM and ICA- 
LiNGAM with five replications. 



Sparse networks 


Sample size 
500 1000 2000 


DirectLiNGAM dim. = 10 

dim. = 20 
dim. = 50 
dim. = 100 


15.16 sec. 37.21 sec. 66.75 sec. 

1.56 mm. b.lb mm. ll.ZZ mm. 
16.25 min. 1.34 hrs. 2.70 hrs. 

2.35 hrs. 21.17 hrs. 19.90 hrs. 


ICA-LiNGAM dim. = 10 

J:™ on 
aim. — zU 

dim. = 50 

dim. = 100 


0.73 sec. 0.41 sec. 0.28 sec. 

o.4U sec. z.4o sec. 1.14 sec. 
14.49 sec. 21.47 sec. 32.03 sec. 
46.32 sec. 58.02 sec. 1.16 min. 


DirectLiNGAM with dim. = 10 
prior knowledge (50%) dim. = 20 

dim. = 50 
dim. = 100 

1/ C / I/J G \ 1 Llll ( llv^ L VV VJL XV O 


4.13 sec. 17.75 sec. 30.95 sec. 
Zo.UZ sec. 1.D4 mm. 4.yo mm. 

7.62 min. 28.89 min. 1.09 hrs. 
48.28 min. 1.84 hrs. 7.51 hrs. 

StftTTlTllp Q17P 

500 1000 2000 


DirectLiNGAM dim. = 10 

dim. = 20 
dim. = 50 
dim. = 100 


8.05 sec. 24.52 sec. 49.44 sec. 
1.00 min. 4.23 min. 6.91 min. 
16.18 min. 1.12 hrs. 1.92 hrs. 
2.16 hrs. 8.59 hrs. 17.24 hrs. 


ICA-LiNGAM dim. = 10 

dim. = 20 
dim. = 50 
dim. = 100 


0.97 sec. 0.34 sec. 0.27 sec. 

5.35 sec. 1.25 sec. 4.07 sec. 
15.58 sec. 21.01 sec. 31.57 sec. 
47.60 sec. 56.57 sec. 1.36 min. 


DirectLiNGAM with dim. = 10 
prior knowledge (50%) dim. = 20 

dim. = 50 
dim. = 100 


2.67 sec. 5.66 sec. 12.31 sec. 
5.02 sec. 31.70 sec. 38.35 sec. 
46.74 sec. 2.89 min. 5.00 min. 
3.19 min. 10.44 min. 19.80 min. 



some cases. This is presumably due to better convergence properties. 

To visualize the estimation results, Figures [I] [21 and @] give combined scat- 
terplots of the estimated elements of B of DirectLiNGAM and ICA-LiNGAM 
versus the true ones for sparse networks and dense (full) networks respectively. 
The different plots correspond to different numbers of variables and different 
sample sizes, where each plot combines the data for different adjacency matri- 
ces B and 18 different distributions of the external influences p(e,). We can see 
that DirectLiNGAM worked well and better than ICA-LiNGAM, as evidenced 
by the grouping of the data points onto the main diagonal. 
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Figure 1: Scattcrplots of the estimated bij by DirectLiNGAM versus the true 
values for sparse networks. 
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Figure 2: Scattcrplots of the estimated bij by ICA-LiNGAM versus the true 
values for sparse networks. 



Finally, we generated datascts in the same manner as above and gave some 
prior knowledge to DirectLiNGAM by creating prior knowledge matrices A knw 
as follows. We first replaced every non-zero element by unity and every diagonal 
element by zero in A=(I — B)^ 1 and subsequently hid each of the off-diagonal 
elements, i.e., replaced it by —1, with probability 0.5. The bottoms of Tables[TJ 
and [5] show the median distances and median computational times. It was em- 
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Figure 3: Scattcrplots of the estimated by DirectLiNGAM versus the true 
values for dense (full) networks. 
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Figure 4: Scatterplots of the estimated bij by ICA-LiNGAM versus the true 
values for dense (full) networks. 

pirically confirmed that use of prior knowledge gave more accurate estimates 
and less computational times in most cases especially for dense (full) networks. 
The reason would probably be that for dense (full) networks more prior knowl- 
edge about where directed paths exist were likely to be given and it narrowed 
down the search space more efficiently. 
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Figure 5: Abstract model of the double-pendulum used in [29] . 

5 Applications to real-world data 

We here apply DirectLiNGAM and ICA-LiNGAM on real-world physics and 
sociology data. Both DirectLiNGAM and ICA-LiNGAM estimate a causal or- 
dering of variables and provide a full DAG. Then we have two options to do 
further analysis [9] : i) Find significant directed edges or direct causal effects bij 
and significant total causal effects with A=(I — B) _1 ; ii) Estimate redundant 
directed edges to find the underlying DAG. We demonstrate an example of the 
former in Subsection 15.11 and that of the latter in Subsection 15.21 

5.1 Application to physical data 

We applied DirectLiNGAM and ICA-LiNGAM on a dataset created from a 
physical system called a double-pendulum, a pendulum with another pendulum 
attached to its end [28] as in Fig. [5] The dataset was first used in [29] . The raw 
data consisted of four time series provided by Ibaraki University (Japan) filming 
the pendulum system with a high-speed video camera at every 0.01 second for 
20.3 seconds and then reading out the position using an image analysis software. 
The four variables were 9\: the angle between the top limb and the vertical, #2 : 
the angle between the bottom limb and the vertical, u)\: the angular speed of 
Q\ or 9 1 and u>2- the angular speed of O2 or 62- The number of time points was 
2035. The dataset is available on the web: 

http : / / www . ar . sanken . osaka-u . ac . j p/~inazumi/dat a/ f ur iko . html 

In 29 . some theoretical considerations based on the domain knowledge im- 
plied that the angle speeds uj\ and u>2 are mainly determined by the angles 
9\ and 62 in both cases where the swing of the pendulum is sufficiently small 
(#i,#2 ^ 0) and where the swing is not very small. Further, in practice, it was 
reasonable to assume that there were no latent confoundcrs [29]. 

As a preprocessing, we first removed the time dependency from the raw 
data using the ARMA (AutoRegressive Moving Average) model with 2 autore- 
gressive terms and 5 moving average terms following [29] , Then we applied 
DirectLiNGAM and ICA-LiNGAM on the preprocessed data. The estimated 
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Figure 6: Left: The estimated network by DirectLiNGAM. Only significant 
directed edges are shown with 5% significance level. Right: The estimated 
network by ICA-LiNGAM. No significant directed edges were found with 5% 
significance level. 




PC GES 



Figure 7: Left: The estimated network by PC algorithm with 5% significance 
level. Right: The estimated network by GES. An undirected edge between two 
variables means that there is a directed edge from a variable to the other or the 
reverse. 



adjacency matrices B of 9\, 62, oj\ and u>2 were as follows: 



DirectLiNGAM 



ICA - LiNGAM 
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O2 






9i 
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°\ 
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1.45 











UJl 


108.82 


-52.73 








LU2 


^216.26 


112.50 


-1.89 





(19) 



(20) 



The estimated orderings by DirectLiNGAM and ICA-LiNGAM were identical, 
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but the estimated connection strengths were very different. We further com- 
puted their 95% confidence intervals by using bootstrapping [30] with the num- 
ber of bootstrap replicates 10000. The estimated networks by DircctLiNGAM 
and ICA-LiNGAM are graphically shown in Fig. |H1 where only significant di- 
rected edges (direct causal effects) 6y are shown with 5% significance level0 
DirectLiNGAM found that the angle speeds uj\ and 0J2 were determined by the 
angles 8\ or 82, which was consistent with the domain knowledge. Though the 
directed edge from 61 to 8 2 might be a bit difficult to interpret, the effect of 8\ 
on 02 was estimated to be negligible since the coefficient of determination [I] 
of 82, i.e., 1— var(e2)/var(#2), was very small and was 0.01. (The coefficient of 
determination of cji and that of u>2 were 0.46 and 0.49 respectively.) On the 
other hand, ICA-LiNGAM could not find any significant directed edges since it 
gave very different estimates for different bootstrap samples. 

For further comparison, we also tested two conventional methods [3TII32"] 
based on conditional independences. Fig. [7] shows the estimated networks by 
PC algorithm [3T] with 5% significance level and GES [33] with the Gaussianity 
assumption. We used the Tetrad IV0 to run the two methods. PC algorithm 
found the same directed edge from 8\ on uj\ as DircctLiNGAM did but did not 
found the directed edge from 6> 2 on uj 2 ■ GES found the same directed edge from 
6\ on 82 as DircctLiNGAM did but did not find that the angle speeds wi and 
ui2 were determined by the angles 8\ or 82- 

We also computed the 95% confidence intervals of the total causal effects ab- 
using bootstrap. DircctLiNGAM found significant total causal effects from 8\ 
on 82, from 8\ on u>i, from 8\ on lu 2 , from 82 on oj\, and from 8 2 on uj 2 . These 
significant total effects would also be reasonable based on similar arguments. 
ICA-LiNGAM only found a significant total causal effect from 82 on uj 2 - 

Overall, although the four variables 8\, 8 2 , oj\ and ui 2 are likely to be nonlin- 
early related according to the domain knowledge [2HI2H], DirectLiNGAM gave 
interesting results in this example. 

5.2 Application to sociology data 

We analyzed a dataset taken from a sociological data repository on the Inter- 
net called General Social Survey ( |http : / / www . nor c . org/GSS+ Website/ ). The 
data consisted of six observed variables, x\: father's occupation level, 22: son's 
income, 23: father's education, 24: son's occupation level, 25: son's educa- 
tion, xq: number of siblings. The sample selection was conducted based on the 
following criteria: i) non-farm background (based on two measures of father's 
occupation); ii) ages 35 to 44; hi) white; iv) male; v) in the labor force at the 
time of the survey; vi) not missing data for any of the covariates; vii) years 
1972-2006. The sample size was 1380. Fig. [8] shows domain knowledge about 
their causal relations [33]. As shown in the figure, there could be some latent 
confounders between x\ and 23, x\ and 26, or 23 and x§. An objective of this 

3 The issue of multiple comparisons arises in this context, which we would like to study in 
future work. 

4 http: //www.phil . emu. edu/projects/tetrad/ 
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Figure 8: Status attainment model based on domain knowledge |33j . A bi- 
directed edge between two variables means that the relation is not modeled. 
For instance, there could be latent confounders between the two, there could be 
a directed edge between the two, or the two could be independent. 



example was to see how our method behaves when such a model assumption of 
LiNGAM could be violated that there is no latent confounder. 

The estimated adjacency matrices B by DirectLiNGAM and ICA-LiNGAM 
were as follows: 
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(21) 



(22) 



We subsequently pruned redundant directed edges bij in the full DAGs by 
repeatedly applying a sparse method called Adaptive Lasso [31] on each variable 
and its potential parents. Sec Appendix A for some more details of Adaptive 
Lasso. We used a matlab implementation in |35j to run the Lasso. Then we 
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obtained the following pruned adjacency matrices B: 
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(24) 



The estimated networks by DirectLiNGAM and ICA-LiNGAM are graphi- 
cally shown in Fig. |H] and Fig. [TU] respectively. All the directed edges estimated 
by DirectLiNGAM were reasonable to the domain knowledge other than the 
directed edge from x§: son's education to X3: father's education. Since the 
sample size was large and yet the estimated model was not fully correct, the 
mistake on the directed edge between x$ and X3 might imply that some model 
assumptions might be more or less violated in the data. ICA-LiNGAM gave a 
similar estimated network but did one more mistake that xe- number of siblings 
is determined by x§: son's education. 

Further, Fig. [TT1 and Fig. [T2lshow the estimated networks by PC algorithm 
with 5% significance level and GES with the Gaussianity assumption. Both 
of the conventional methods did not find the directions of many edges. The 
two conventional methods found a reasonable direction of the edge between x\: 
father's occupation and X3: father's education, but they gave a wrong direction 
of the edge between x±: father's occupation and X4: son's occupation. 



6 Conclusion 

We presented a new estimation algorithm for the LiNGAM that has guaranteed 
convergence to the right solution in a fixed number of steps if the data strictly 
follows the model and known computational complexity unlike most ICA meth- 
ods. This is the first algorithm specialized to estimate the LiNGAM. Simulations 
implied that the new method often provides better statistical performance than 
a state of the art method based on ICA. In real-world applications to physics 
and sociology, promising results were obtained. Future works would include i) 
assessment of practical performance of statistical tests to detect violations of 
the model assumptions including tests of independence [36] ; ii) implementation 
issues of our algorithm to improve the practical computational efficiency. 
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Figure 9: The estimated network by DirectLiNGAM and Adaptive Lasso, 
red solid directed edge is reasonable to the domain knowledge. 
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Figure 10: The estimated network by ICA-LiNGAM and Adaptive Lasso, 
red solid directed edge is reasonable to the domain knowledge. 
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Figure 11: The estimated network by PC algorithm with 5% significance level. 
An undirected edge between two variables means that there is a directed edge 
from a variable to the other or the reverse. A red solid directed edge is reasonable 
to the domain knowledge. 
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Figure 12: The estimated network by GES. An undirected edge between two 
variables means that there is a directed edge from a variable to the other or the 
reverse. A red solid directed edge is reasonable to the domain knowledge. 



A Adaptive Lasso 

We very briefly review the adaptive Lasso [34], which is a variant of the Lasso 
[37] . See [34] for more details. The adaptive Lasso is a rcgularization technique 
for variable selection and assumes the same data generating process as LiNGAM: 

Xi = bijXj + e l . (25) 

k(j)<k(i) 
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A big difference is that the adaptive Lasso assumes that the set of such potential 
parent variables Xj that k(j)<k(i) is known and LiNGAM estimates the set of 
such variables. The adaptive Lasso penalizes connection strengths bij in L\ 
penalty by minimizing the objective function defined as: 



where A and 7 are tuning parameters and b^ is a consistent estimate of bij . In 
[3~i] , it was suggested to select the tuning parameters by five- fold cross validation 
and to obtain bij by ordinary least squares regression. The adaptive Lasso has 
a very attractive property that it asymptotically selects the right set of such 
variables Xj that bij is not zero, where k(j)<k(i). 
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